evaluation practice
Neither Valid nor Reliable? Investigating the Use of LLMs as Judges
Chehbouni, Khaoula, Haddou, Mohammed, Cheung, Jackie Chi Kit, Farnadi, Golnoosh
Evaluating natural language generation (NLG) systems remains a core challenge of natural language processing (NLP), further complicated by the rise of large language models (LLMs) that aims to be general-purpose. Recently, large language models as judges (LLJs) have emerged as a promising alternative to traditional metrics, but their validity remains underexplored. This position paper argues that the current enthusiasm around LLJs may be premature, as their adoption has outpaced rigorous scrutiny of their reliability and validity as evaluators. Drawing on measurement theory from the social sciences, we identify and critically assess four core assumptions underlying the use of LLJs: their ability to act as proxies for human judgment, their capabilities as evaluators, their scalability, and their cost-effectiveness. We examine how each of these assumptions may be challenged by the inherent limitations of LLMs, LLJs, or current practices in NLG evaluation. To ground our analysis, we explore three applications of LLJs: text summarization, data annotation, and safety alignment. Finally, we highlight the need for more responsible evaluation practices in LLJs evaluation, to ensure that their growing role in the field supports, rather than undermines, progress in NLG.
- Europe > Austria > Vienna (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Asia > Thailand > Bangkok > Bangkok (0.05)
- (18 more...)
- Health & Medicine (0.46)
- Law (0.46)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.46)
Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance?
Prandi, Matteo, Suriani, Vincenzo, Pierucci, Federico, Galisai, Marcello, Nardi, Daniele, Bisconti, Piercosma
The rapid advancement of General Purpose AI (GPAI) models necessitates robust evaluation frameworks, especially with emerging regulations like the EU AI Act and its associated Code of Practice (CoP). Current AI evaluation practices depend heavily on established benchmarks, but these tools were not designed to measure the systemic risks that are the focus of the new regulatory landscape. This research addresses the urgent need to quantify this "benchmark-regulation gap." We introduce Bench-2-CoP, a novel, systematic framework that uses validated LLM-as-judge analysis to map the coverage of 194,955 questions from widely-used benchmarks against the EU AI Act's taxonomy of model capabilities and propensities. Our findings reveal a profound misalignment: the evaluation ecosystem dedicates the vast majority of its focus to a narrow set of behavioral propensities. On average, benchmarks devote 61.6% of their regulatory-relevant questions to "Tendency to hallucinate" and 31.2% to "Lack of performance reliability", while critical functional capabilities are dangerously neglected. Crucially, capabilities central to loss-of-control scenarios, including evading human oversight, self-replication, and autonomous AI development, receive zero coverage in the entire benchmark corpus. This study provides the first comprehensive, quantitative analysis of this gap, demonstrating that current public benchmarks are insufficient, on their own, for providing the evidence of comprehensive risk assessment required for regulatory compliance and offering critical insights for the development of next-generation evaluation tools.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- North America > Canada > Newfoundland and Labrador > Labrador (0.04)
- Europe > Monaco (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.87)
- Government (1.00)
- Law > Statutes (0.87)
- Information Technology > Security & Privacy (0.66)
SPHERE: An Evaluation Card for Human-AI Systems
Ma, Qianou, Zhao, Dora, Zhao, Xinran, Si, Chenglei, Yang, Chenyang, Louie, Ryan, Reiter, Ehud, Yang, Diyi, Wu, Tongshuang
In the era of Large Language Models (LLMs), establishing effective evaluation methods and standards for diverse human-AI interaction systems is increasingly challenging. To encourage more transparent documentation and facilitate discussion on human-AI system evaluation design options, we present an evaluation card SPHERE, which encompasses five key dimensions: 1) What is being evaluated?; 2) How is the evaluation conducted?; 3) Who is participating in the evaluation?; 4) When is evaluation conducted?; 5) How is evaluation validated? We conduct a review of 39 human-AI systems using SPHERE, outlining current evaluation practices and areas for improvement. We provide three recommendations for improving the validity and rigor of evaluation practices.
- North America > United States > Hawaii > Honolulu County > Honolulu (0.05)
- Asia > Singapore (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- (17 more...)
- Research Report > Experimental Study (1.00)
- Questionnaire & Opinion Survey (0.94)
- Health & Medicine (1.00)
- Education (1.00)
- Media (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)
Scenario-based Evaluation of Prediction Models for Automated Vehicles
Sánchez, Manuel Muñoz, Elfring, Jos, Silvas, Emilia, van de Molengraft, René
To operate safely, an automated vehicle (AV) must anticipate how the environment around it will evolve. For that purpose, it is important to know which prediction models are most appropriate for every situation. Currently, assessment of prediction models is often performed over a set of trajectories without distinction of the type of movement they capture, resulting in the inability to determine the suitability of each model for different situations. In this work we illustrate how standardized evaluation methods result in wrong conclusions regarding a model's predictive capabilities, preventing a clear assessment of prediction models and potentially leading to dangerous on-road situations. We argue that following evaluation practices in safety assessment for AVs, assessment of prediction models should be performed in a scenario-based fashion. To encourage scenario-based assessment of prediction models and illustrate the dangers of improper assessment, we categorize trajectories of the Waymo Open Motion dataset according to the type of movement they capture. Next, three different models are thoroughly evaluated for different trajectory types and prediction horizons. Results show that common evaluation methods are insufficient and the assessment should be performed depending on the application in which the model will operate.
- Europe > Netherlands > North Brabant > Eindhoven (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Information Technology > Modeling & Simulation (1.00)
- Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)
Design Guidelines for Inclusive Speaker Verification Evaluation Datasets
Hutiri, Wiebke Toussaint, Gorce, Lauriane, Ding, Aaron Yi
Speaker verification (SV) provides billions of voice-enabled devices with access control, and ensures the security of voice-driven technologies. As a type of biometrics, it is necessary that SV is unbiased, with consistent and reliable performance across speakers irrespective of their demographic, social and economic attributes. Current SV evaluation practices are insufficient for evaluating bias: they are over-simplified and aggregate users, not representative of real-life usage scenarios, and consequences of errors are not accounted for. This paper proposes design guidelines for constructing SV evaluation datasets that address these short-comings. We propose a schema for grading the difficulty of utterance pairs, and present an algorithm for generating inclusive SV datasets. We empirically validate our proposed method in a set of experiments on the VoxCeleb1 dataset. Our results confirm that the count of utterance pairs/speaker, and the difficulty grading of utterance pairs have a significant effect on evaluation performance and variability. Our work contributes to the development of SV evaluation practices that are inclusive and fair.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > Canada (0.05)
- Asia > India (0.05)
- (9 more...)
- Law (0.46)
- Information Technology > Security & Privacy (0.34)
Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text
Gehrmann, Sebastian, Clark, Elizabeth, Sellam, Thibault
Evaluation practices in natural language generation (NLG) have many known flaws, but improved evaluation approaches are rarely widely adopted. This issue has become more urgent, since neural NLG models have improved to the point where they can often no longer be distinguished based on the surface-level features that older metrics rely on. This paper surveys the issues with human and automatic model evaluations and with commonly used datasets in NLG that have been pointed out over the past 20 years. We summarize, categorize, and discuss how researchers have been addressing these issues and what their findings mean for the current state of model evaluations. Building on those insights, we lay out a long-term vision for NLG evaluation and propose concrete steps for researchers to improve their evaluation processes. Finally, we analyze 66 NLG papers from recent NLP conferences in how well they already follow these suggestions and identify which areas require more drastic changes to the status quo.